Learning Text Extraction Rules, without Ignoring Stop Words
نویسندگان
چکیده
Information Extraction (IE) from text /web documents has become an important application area of AI. As the number of web sites and documents has grown dramatically, the users need an easy, fast and flexible ways of generating systems that can carry out specific IE tasks. This can be achieved with the help of Machine Learning (ML) techniques. We have developed a system that exploits this strategy. After training the system is capable of identifying certain relevant elements in the text and extracting the corresponding information. As input, system takes a collection of text documents (in a certain domain), that have been previously annotated by a user. This is used to generate extraction rules. We describe a set of experiments that have been oriented towards the domain of announcements (in Portuguese) concerning house/flat sales. We show that quite good results overall can be achieved using this methodology. In previous work some authors argue that stop words should really be eliminated before training. We have decided to reexamine this assumption and present evidence that these can be quite useful in some sub-tasks.
منابع مشابه
روش جدید متنکاوی برای استخراج اطلاعات زمینه کاربر بهمنظور بهبود رتبهبندی نتایج موتور جستجو
Today, the importance of text processing and its usages is well known among researchers and students. The amount of textual, documental materials increase day by day. So we need useful ways to save them and retrieve information from these materials. For example, search engines such as Google, Yahoo, Bing and etc. need to read so many web documents and retrieve the most similar ones to the user ...
متن کاملCSCR010: Second Year Report
The aim of my PhD research is focused on Text Mining, one major research school in Knowledge Discovery in Databases (KDD), and in particular Text Preprocessing (TPP) for classification / categorization of documents utilizing novel algorithms for the identification of hidden patterns, rules, regularities and trends within these documents. Significant techniques in Data Mining, another wellknown ...
متن کاملExtracting information for biology
as GeneRIF annotation, 2/3 of the systems could be outperformed. 2.1.4 Summary information retrieval Methods Document indexing in combination with statistical methods (Bag-of-words aproach) Query Any set of words is treated as bag of words Domain Mostly Un-restricted Result Set of Documents or Passages Interpretation of Results The user interprets the retrieved documents Evaluation Clear: Preci...
متن کاملLearning for Text Categorization and Information Extraction with ILP
Text Categorization (TC) and Information Extraction (IE) are two important goals of Natural Language Processing. While handcrafting rules for both tasks has a long tradition, learning approaches gained much interest in the past. In the present paper we try to provide a solid basis for the application of ILP methods to these learning problems. We propose to introduce three basic types (namely a ...
متن کاملImproving Precision of Keywords Extracted From Persian Text Using Word2Vec Algorithm
Keywords can present the main concepts of the text without human intervention according to the model. Keywords are important vocabulary words that describe the text and play a very important role in accurate and fast understanding of the content. The purpose of extracting keywords is to identify the subject of the text and the main content of the text in the shortest time. Keyword extraction pl...
متن کامل